Smartphones dependency risk analysis using machine-learning predictive models

Recent technological advances have changed how people interact, run businesses, learn, and use their free time. The advantages and facilities provided by electronic devices have played a major role. On the other hand, extensive use of such technology also has adverse effects on several aspects of human life (e.g., the development of societal sedentary lifestyles and new addictions). Smartphone dependency is new addiction that primarily affects the young population. The consequences may negatively impact mental and physical health (e.g., lack of attention or local pain). Health professionals rely on self-reported subjective information to assess the dependency level, requiring specialists' opinions to diagnose such a dependency. This study proposes a data-driven prediction model for smartphone dependency based on machine learning techniques using an analytical retrospective case–control approach. Different classification methods were applied, including classical and modern machine learning models. Students from a private university in Cali—Colombia (n = 1228) were tested for (i) smartphone dependency, (ii) musculoskeletal symptoms, and (iii) the Risk Factors Questionnaire. Random forest, logistic regression, and support vector machine-based classifiers exhibited the highest prediction accuracy, 76–77%, for smartphone dependency, estimated through the stratified-k-fold cross-validation technique. Results showed that self-reported information provides insight into predicting smartphone dependency correctly. Such an approach opens doors for future research aiming to include objective measures to increase accuracy and help to reduce the negative consequences of this new addiction form.

www.nature.com/scientificreports/ The volunteers signed an informed consent form before participating in the study. Those individuals who submitted an incomplete form or frequently played sports or artistic activities involving the upper limbs were excluded.
The Smartphone Dependency Test is a free-to-use test created by Chóliz 30 , which was validated and linguistically adapted in 2016 for students receiving both public and private education 31 . This test was used to measure the level of independence of Mobile Devices (MD), which was assigned as the dependent variable. The test lasted 10 min and consisted of 22 items presented using a Likert-type scale. The scores range from 0 (zero) to 88 as the maximum to determine whether the dependency was absent (0-29), low (30)(31)(32)(33)(34)(35)(36)(37)(38), medium (39)(40)(41)(42)(43)(44)(45)(46)(47)(48), or high . In addition, musculoskeletal disorders (MSD) were characterized via the Nordic Questionnaire, in its Spanish version, whose application lasted 7 min. The questionnaire comprised two levels: (i) a general level that sought to determine the occurrence of musculoskeletal discomfort by anatomical regions, and (ii) a specific level that focused on delving into the chronology, frequency, duration, intensity, and impact of the discomfort on their everyday activities.
The risk factors were the independent variables. The Risk Factors Questionnaire was designed and subjected to internal validation by the researchers through the Delphi method by a group of 6 experts, obtaining a validity of approximately 0.9, according to Chronbach's alpha; its application lasted 7 min. This questionnaire included the variables considered in the theoretical framework about sociodemographic, interpersonal, and contextual factors related to the device and physical load. It was possible to identify the risk factors in the university student population 32 .
The study followed the principles of the Helsinki Declaration, guaranteeing confidentiality by coding and signing the informed consent before participation. Regarding data collection, this study protocol was doubly reviewed and endorsed by the Scientific Committee of Ethics and Bioethics of the Universidad Santiago de Cali (act # 03 of 2019). Data analysis. The data were recorded by a double entry in Excel. The information from the two databases was compared, and unmatched data were cleaned, performing verification in the primary source.
To structure the model construction, the variables were transformed into categorical types for the processing and analysis phase. The data allocation, which was 1%, was performed using the mode for qualitative variables and the arithmetic mean for quantitative variables. Once the information was validated, a descriptive exploratory analysis of the different variables was conducted to determine their behavior. Subsequently, a bivariate analysis was performed to determine which were included in the model and selected for statistical significance with a p-value < 0.05. Figure 1 shows a schematic representation of the research approach. It indicates a general-purpose patternrecognition system adapted to address the overuse of smartphones. First, participants answered three questionnaires (i.e., the Smartphone Dependency Test, the Nordic Questionnaire-Spanish version, and the Risk Factors Questionnaire) used by health professionals to assess the participant dependency level. Next, a selection strategy and descriptive exploratory analyses of the different variables were performed to determine which predictors were highly correlated to the target variables. As a result, 31 variables were selected and used to feed the data-driven predictive model. Two groups of algorithms were applied-i.e., the classical approach and the deep learning approach. The details of the algorithms are provided in the following section. Finally, based on these predictive models, smartphone dependency and overusing were estimated.
Data processing, debugging, modeling, and validation were structured in six stages and are described in Fig. 2.
Supervised machine learning techniques. Machine learning has been successfully used in several research areas with applications in medical signal processing, computer-assisted systems, language processing, and healthcare, among others. From the classical point of view to more recent deep learning techniques, datadriven models try to capture the inner structure of data derived from external systems. These models help make predictions on new unseen data 26,33,34 . There is a wide range of applications that vary from healthcare, transportation, social networks, banking, security, and education. Internet of Things (IoT) Networks is widespread in many industrial applications. Machine Learning models help identify and avoid malicious traffic attacks, which can affect network security and essential services [35][36][37][38] . These techniques have been used to improve the user's experience and decision-making process, which are more subjective scenarios and more dependent on the user's psychological characteristics 39,40 . It is important to note that in such scenarios, it is necessary to analyze people's opinions, sentiments, perceptions, etc., to help develop tools in multiple situations to allow users' interaction www.nature.com/scientificreports/ with applications, products, and services [40][41][42] . This is the possibility explored in this study, in which users are required to respond to a self-report standardized questionnaire that can be linked to smartphone dependency.
To have a precise notation, x (i) denotes the input variables arranged as an n-dimension vector, also known as features, while y (i) indicates the output or target variable (i.e., the predicting variable). The pair (x (i) ,y (i) ) is a training example. The dataset containing the information from m training examples {(x (i) ,y (i) )}; i = 1… m, is known as the training set. Typically, X and Y are used to denote the space representations of the input and output variables, respectively. When a classification problem is approached, the variables in the Y space take discrete values corresponding to the classes or categories defined in the learning problem. For the specific problem addressed in this work, y ∈ {0, 1}, where a value y = 0 has been defined to indicate a person with a negative diagnosis, whereas y = 1 indicates a person with a positive diagnosis of smartphone dependency.
A supervised learning problem estimates a function h ɵ (x): X → Y, such that given an input x, h ɵ (x) predicts the y value. The function h ɵ (x) is also known as the hypothesis function.
Several approaches have been applied to define the h ɵ (x) function. From classical approaches such as logistic regression 43 , Support vector machines (SVM) with polynomial and Radial Basis Functions (RBF) kernels, which is considered a discriminative approach 44 , Decision tree 45 , and Random forest 46 , to modern approaches based on deep learning (DL) such as multilayer perceptron (MLP) 33 , and tabular data such as TabNet 47 , as is the particular case of the present study. A detailed description of previously mentioned techniques is out of the scope of this paper.
Deep learning techniques are well known for their performance when solving problems related to images, audio, and text 25,26 . One of the shortcomings of training a deep learning model is having sufficient data for a proper parameter estimation 26 . Some approaches include transiently modifying the output to fit the requirements and then fine-tuning learning, where a previously trained model can be applied 25 . However, in this work, the amount of data was relatively limited to infer that a deep neural network would be adequately trained; neither three are pre-trained models of adjacent problems so that transfer learning can be used. Hence, classical machine learning techniques are expected.
System validation. The assisted diagnosis process using automated systems is imperfect. The result obtained from a classification system represents a probability rather than a correct answer with irrefutable certainty. Different diagnostic measures are thus employed to verify and assure that the results are repeatable and to validate the ability of a system to identify the presence or absence of disease.
In particular, random cross-validation (tenfold) was used in these experiments. The available data were used for data training (70%), and the remaining data (30%) to test the proposed model 33 . It is important to note that the folds were randomly assembled using a shuffle-split methodology in its stratified version to guarantee a proportional distribution in each set 34 . Each classification approach was evaluated using logistic regression, support vector machine, decision tree, random forest, multilayer perceptron, and TabNet. For assessing the performance of each model, diagnostic measures such as sensitivity, specificity, accuracy, and precision are used.

Results
The data analyses indicated that 70% of the participants presented smartphone dependence. Initially, a preliminary analysis was conducted to identify variables with a more prominent relationship with the response variable. Hence, the chi-square test for categorical variables and the odds ratio (OR) for dichotomous qualitative variables were applied. According to this analysis, the following variables were identified as related to smartphone dependency in students:  Table 1 shows the discriminated results for each variable. The risk factors are presented, and the variables and their corresponding sub-categories are indicated. The frequency and percentage of students classified as having dependency (cases) are also shown.
The responses associated with the identification of musculoskeletal discomforts indicated the wrist as the body area with the highest risk factor (OR = 1.93, CI 95% = 1.47-2.54)). The neck, shoulder, back, and elbow regions showed similar risk factors (OR = 1.42, 1.62, 1.88, and 1.89, respectively). The results are summarized in Table 2. Table 3 shows the results for the discomfort in the previous 12 months according to smartphone dependency. The results found the elbow (OR = 1.45) and shoulder (OR = 1.69) with the highest risk for discomfort, while the back area with the lowest.
Machine learning based prediction system. All the significant variables from the different models performed were included. A total of 31 variables related to smartphone dependence were identified. Table 4 shows the results for all classifiers in which the accuracy, specificity, sensitivity, precision, and area of the ROC curve of five diagnostic measures are presented. For the random forest, n_e is the number of estimators or trees in the forest. For SVM C is the regularization parameter, γ is the kernel coefficient for both polynomial and radial basis functions, and d is the degree of the polynomial kernel. In the case of the multilayer perceptron, we use a DNN with six hidden layers with 50, 50, 50, 20, 20, and 10 neurons using relu activation functions connected to an output layer with one single neuron using a sigmoidal activation function.
Differences were observed among the methods under study, considering the metrics to assess their performance. For example, the TabNet model and the decision tree have the lowest overall rates; however, the decision tree presented the highest specificity value, above 50%. In contrast, for logistic regression, random forest, and both support vector machine approaches, better sensitivity rates were achieved (above 91%), but specificity was significantly reduced (below 41%). As expected, neither the TabNet model nor the multilayer perceptron performed better than the classical approaches.
To perform a global evaluation for each classifier, the AUC of the ROC curve was determined (Fig. 3). It was observed that the classifier with the lowest performance was the TabNet model, followed by the decision tree. On the other hand, the similar AUC of the five models (AUC ~ 0.72) makes it challenging to determine which approach offers the best performance. Overall, considering the model's simplicity, the number of parameters, and the performance achieved by the logistic regression classification approach, such an approach is a suitable predictive model for the task at hand. However, the SVM or random forest classifiers constitute attractive alternatives, given that these approaches have comparable high performances.
It is worth mentioning that a highly sensitive system can correctly identify participants where smartphone dependency is suspected. Hence, self-reported information gathered through standardized questionnaires contains discriminative features to train predictive models. However, the perceptual and subjective nature of the information can also hamper the potential of predictive models. This may be the reason for achieving low specificity. In the early stages of a diagnosis, it is helpful to include the assessment of multiple professionals to      www.nature.com/scientificreports/ reject or confirm dependency. It would be necessary to include objective measurements to improve the system's prediction capabilities in future works.

Discussion
The classification models yielded satisfactory smartphone dependency predictions. Likewise, a relationship between university students with and without smartphone dependency and multiple risk factors was found, which should motivate establishing high-priority preventive actions. The results indicate that student enrollment was significantly correlated with smartphone dependency, and an important prevalence was identified, especially among engineering (84.3%), health (77.8%), law (68.1%), and economic sciences students (50.0%). Similar results have been reported, although the highest dependency rate was identified in the medical academic program 50 . Marital status (72%) was related to smartphone dependency, which is in line with previous studies [51][52][53][54] . However, being single cannot be included as a risk factor. It can be hypothesized that being involved in a romantic relationship may reduce smartphone users' time. Nevertheless, this is a factor that requires additional analysis.
The high-income socioeconomic stratification was also meaningful for smartphone users, as it facilitates access to new technology, gadgets, pay-per-use applications, etc. 52,53,55 . Our data corroborate previous reports that high family income is more likely to develop smartphone dependency 17 . In addition, young students may feel discriminated against for not having a cell phone and not satisfying a communication prerequisite to belong to a particular social group. Cellphone ownership is highly relevant in today's society, where social networks are at the core of personal and social relationships. It might have also accelerated the first cell phone acquisition, as dependency is more pronounced (74.1%) in those who used it for the first time more than six years ago. Others have also reported a similar dependence (77.5%) 56 . Further investigations are necessary to explore the causes of its acquisition and excessive use.
Adverse domestic situations can also be a predictor related to smartphone dependency 57 . It has been shown that students who reported domestic conflict or adversities (e.g., parent alcohol and drug use, mental health, incarceration, suicide, intimate partner violence, separation/divorce, and homelessness) are also more likely to have problematic/addictive smartphone use. A strong association between household dysfunction and psychological and behavioral health issues was reported. However, this association requires further research to explain this association further.
A significant difference was found between those who access the internet by paying for data packages and illimited access. Having internet access with no limitations facilitates surfing the internet, making video calls, gaming, sending text messages anytime, etc. The result showed that having a data plan increases the probability of developing smartphone dependency by 50%, as the number of hours is also likely to be greater than others with more limited access.
The amount of time spent using cell phones is also a strong indicator of dependence. In this study, the participants with smartphone addiction reported periods of usage longer than 6 h. It has been reported that the likelihood of developing smartphone addiction is proportional to the number of hours of use (3-4 h: OR = 5.79;   17 . Indeed, the risk almost doubled for those using the device for 5-6 h compared to those with fewer hours (i.e., 3-4 h per day) 58 .
Sitting was the most predominant posture while using a smartphone (66.3%), despite the short period it was sustained (i.e., less than an hour). It may explain why the wrist and the neck areas showed the largest prevalence (OR. 1.93 and 1.42, respectively). It has been reported that office workers with excessive smartphone use are approximately six times more likely to have neck pain 59 . It reinforces that smartphone dependency is highly associated with neck pain. Nonetheless, the prevalence was lower than reported by Derakhshanrad and colleagues 59 . There can be multiple reasons for this difference, including the location, target population, and instrument applied. In this study, university students with smartphone dependency reported discomfort or musculoskeletal symptoms for less than one month (n = 532, 65.8%). Hence, the presence and duration of musculoskeletal discomfort in the last 12 months contribute to the prediction of smartphone dependency.
The variables used in the model show that sociodemographic characteristics determine a level of smartphone dependency. However, the age and gender variables must be ruled out. For instance, Nikhita and collaboratives reported that female users had a higher prevalence 60 , while Matoza-Báez and colleagues 61 showed a higher prevalence of male users. The age of more than 90% of our participants ranged between 18 and 32, and a more comprehensive range is required to discard age as an explanatory factor. This is a cross-sectional analysis, and longitudinal studies are required before establishing a cause-effect relationship. The inclusion and analysis of variables related to academic performance, mental health, and sleep disorders may be considered for future studies. Although the number of participants included in the present study is not trivial, the amount of data affects the training process of the models, and it remains an open problem to address in future studies, including deep learning techniques. Once risk factors and variables related to smartphone dependency are identified, it is essential to mention that strategies to reduce these risks and adverse effects are paramount for society. It should involve a multidisciplinary approach. Campaigns to raise awareness about the negative consequences of physical and mental health and how to address these problems or where people can find professional advice may constitute a relevant strategy to counteract the adverse impacts of overusing technology.

Conclusions
Smartphones are ubiquitous and part of our daily life. The adverse effects of excessive use of smartphones are concerning, as dependency is becoming a public health problem requiring special attention due to its consequences on physical and mental health. Machine learning helped identify several dependency factors while using a large number of independent variables. The support vector machine and random forest presented the highest prediction precision for smartphone dependency, obtained through the stratified-k-fold cross-validation technique. The variable selection is more critical than the choice of a specific model itself.
This study shows that self-reported information obtained using standardized questionnaires contains discriminative information to predict smartphone dependency using data-driven models. These results open doors for future studies aiming to reduce the adverse effects of overusing mobile devices. In many cases, a correct assessment of dependency levels and the corrective actions to be taken require the intervention of experienced health professionals. This is not always possible in the early stages, while late interventions can be costly and may bring adverse effects. Further research in this area is still required, as the perceptual and subjective nature of the information may hamper the potential of predictive models. For future work, it is necessary to introduce objective measures. Using electronics to measure physiological activity can add important information instead of subjective self-reported variables.

Data availability
Datasets analyzed during the current research are available to the corresponding author upon reasonable request.